84 research outputs found

    On feature selection protocols for very low-sample-size data

    Get PDF
    High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol.project RPG-2015-188 funded by The Leverhulme Trust, UK and by project TIN2015-67534-P (MINECO/FEDER, UE) funded by the Ministerio de Economía y Competitividad of the Spanish Government and European Union FEDER fund

    Selection Of Independent Binary Features Using Probabilities: An Example From Veterinary Medicine

    Get PDF
    Supervised classification into c mutually exclusive classes based on n binary features is considered. The only information available is an n×c table with probabilities. Knowing that the best d features are not the d best, simulations were run for 4 feature selection methods and an application to diagnosing BSE in cattle and Scrapie in sheep is presented

    Classifier ensembles for f MRI data analysis: an experiment

    Get PDF
    Abstract Functional magnetic resonance imaging (fMRI) is becoming a forefront brain-computer interface tool. To decipher brain patterns, fast, accurate and reliable classifier methods are needed. The support vector machine (SVM) classifier has been traditionally used. Here we argue that state-of-the-art methods from pattern recognition and machine learning, such as classifier ensembles, offer more accurate classification. This study compares 18 classification methods on a publicly available real data set due to Haxby et al. [Science 293 (2001[Science 293 ( ) 2425[Science 293 ( -2430. The data comes from a single-subject experiment, organized in 10 runs where eight classes of stimuli were presented in each run. The comparisons were carried out on voxel subsets of different sizes, selected through seven popular voxel selection methods. We found that, while SVM was robust, accurate and scalable, some classifier ensemble methods demonstrated significantly better performance. The best classifiers were found to be the random subspace ensemble of SVM classifiers, rotation forest and ensembles with random linear and random spherical oracle

    Combining univariate approaches for ensemble change detection in multivariate data

    Get PDF
    Detecting change in multivariate data is a challenging problem, especially when class labels are not available. There is a large body of research on univariate change detection, notably in control charts developed originally for engineering applications. We evaluate univariate change detection approaches —including those in the MOA framework — built into ensembles where each member observes a feature in the input space of an unsupervised change detection problem. We present a comparison between the ensemble combinations and three established ‘pure’ multivariate approaches over 96 data sets, and a case study on the KDD Cup 1999 network intrusion detection dataset. We found that ensemble combination of univariate methods consistently outperformed multivariate methods on the four experimental metrics.project RPG-2015-188 funded by The Leverhulme Trust, UK; Spanish Ministry of Economy and Competitiveness through project TIN 2015-67534-P and the Spanish Ministry of Education, Culture and Sport through Mobility Grant PRX16/00495. The 96 datasets were originally curated for use in the work of Fernández-Delgado et al. [53] and accessed from the personal web page of the author5. The KDD Cup 1999 dataset used in the case study was accessed from the UCI Machine Learning Repository [10
    • …
    corecore